DataCamp - Pandas Foundation

Datacamp course notes on pandas foundation.

Data Ingestion & Inspection

Basic Concepts of Pandas DataFrames

Pandas DataFrame is a tabular data structure with columns and rows. Rows are labeled by a data structure called index, which is tabled lists of labels that permit fast look up and powerful relational operations

Columns are also labeld by a list index. Each column is a specialized data structure called a panda series, which is a 1-d labeled numpy array. DataFrame is a 2-d numpy array.

Each series has its own head method and inherit its name attributes from its column. To extract the numerical values of the series, we use the .values attribute

Assigning NAs to every third element/row in the last column

1
2
import numpy as np
AAPL.iloc[::3, -1] = np.nan

Create DataFrames

Create dataframe by combining lists

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Zip the 2 lists together into one list of (key,value) tuples: zipped
zipped = list(zip(list_keys, list_values))

# Inspect the list using print()
print(zipped)

# Build a dictionary with the zipped list: data
data = dict(zipped)

# Build and inspect a DataFrame from the dictionary: df
df = pd.DataFrame(data)
print(df)

# Rename dataframe
list_labels = ['year', 'artist', 'song', 'chart weeks']
df.columns = list_labels

Creating dataframe by converting dictionaries

1
2
3
4
5
6
7
8
9
10
11
# Make a string with the value 'PA': state
state = 'PA'

# Construct a dictionary: data
data = {'state':state, 'city':cities}

# Construct a DataFrame from dictionary data: df
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

Importing and Exporting Data

1
2
3
4
5
6
7
8
9
10
11
12
colnames = ['a', 'b', 'c', 'd']
pd.read_csv(filepath,
header = None, #prevents pandas from assuming the first line as the column labels. header = 0 means the first row is the header
names = col_names,
na_values = {'sunspots':['-1']}, #replace all value -1 in the column sunspots with na
parse_dates = [[0, 1, 2]], #originally there were three columns: year, month, day at column 0,1,2, with this argument, they are aggregated into one column inferring dates. Or we can simply write parse_dates = True
delimiter = ' ',
comment = '#', #indicate the comment part starts with #
index_col = 0 #set the first column as row index
)
sunspots.to_csv('sunspots.tsv', sep = '\t')
sunspots.to_excel('sunspots.xlsx')

Ploting the data

1
2
3
4
5
6
7
8
9
import matplotlib.pyplot as plt
appl.plot() #plots all columns at once against the row index
appl[['open', 'close']].plot(color = 'b', #blue
style = '.-', #or '.', '-'
legend = True)
plt.yscale('log') #log y axis
plt.axis(('2001', '2002', 0, 100))
plt.savefig('aapl.png') #suffix can also be .jpg, .pdf
plt.show()

Exploratory Data Analysis

Iris data set: 150 observations, 4 features each

  • Sepal length, width, Petal length and width.
  • 3 Species: setosa, versicolor, virginica

Visual EDA

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
import matplotlib.pyplot as plt
iris = pd.read_csv('iris.csv', index_col = 0)
iris.shape
iris.head()
iris.plot(kind = 'scatter', x = 'sepal_length', y = 'sepal_width')

# Make subplots
cols = ['weight', 'mpg']
df[cols].plot(kind = 'box', subplots = True)
plt.show()
  • Scatter plot: kind = 'scatter'
  • Boxplot: kind = 'box'
  • Histogram: kind = 'hist'.
    1. bins(integer): number of intervals or bins
    2. range(tuple): extrema of bins (min, max)
    3. normed(boolean): whether to normalize to one
    4. cumulative(boolean): compute Cumulative Distribution Function(CDF)

Note: There are three different DataFrame plot idioms:

  • iris.plot(kind = 'hist')
  • iris.plt.hist()
  • iris.hist()
    Their syntax and results also differ from each other.

Formatting the plots. Plotting PDF, CDF

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# This formats the plots such that they appear on separate rows
fig, axes = plt.subplots(nrows = 2, ncols = 1)

# Plot the PDF
df.fraction.plot(ax = axes[0], #plot will appear in the first row
kind = 'hist',
normed = True, #needs for pdf
bins = 30,
range = (0,.3))
plt.show()

# Plot the CDF
df.fraction.plot(ax = axes[1], #plot will appear in the second row
kind = 'hist',
normed = True,
cumulative = True, #needs for cdf
bins = 30,
range = (0,.3)
alpha = 0.3) #transparency
plt.show()

# Plot boxplot with filtering
titanic.loc[titanic['pclass'] == 1].plot(ax = axes[0], y = 'fare', kind = 'box')

Statistical EDA

.describe() for numeric values: count, mean, std, min, 25%, median, 75%, max
.describe() for categorical values:

  • count, which is the number of non-null entries
  • unique, number of distinct values
  • top, most frequent category
  • freq, number of occurences of top
    .unique() returns the unique values in a column
    .mean(): .mean(axis = 'columns') computes the mean across all columns per row.
    .std()
    .median()
    .quantile(q) where q = (0, 1). To calculate intern-quartile range (IQR): q = [0.25, 0.75]. We can also use boxplot to visualy understand the quartile.
    .min(), .max()
    .count()

Filtering by species

1
2
3
4
5
6
7
indices = iris['species'] == 'setosa'
setosa = iris.loc[indices, :]

describe_all = iris.describe()
describe_setosa = setosa.describe()
error_setosa = 100 * np.abs(describe_setosa - describe_all)
error_setose = error_setosa/describe_setosa

Time Series in Pandas

read_csv() function can read strings into datetime objects with argument parse_dates = True. The date will be stored as yyyy-mm-dd hh:mm:ss. In time series data, it is also useful to set the date column as index, so that we can perform date time slicing easily.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pandas as pd
sales = pd.read_csv('sales-feb-2016.csv',
parse_date = True,
index_col = 'Date')
sales.loc['2015-02-19 11:00:00', 'Company']
sales.loc['2015-2-5'] #select all rows on this day
#Alternatively
sales.loc['February 5, 2015']
sales.loc['2015-Feb-5']

sales.loc['2015-2'] #select the entire month
sales.loc['2015'] #select the entire year

sales.loc['2015-2-5':'2015-2-10'] #time window

# Convert strings to pandas datetime object
time_format = '%Y-%m-%d %H:%M'
evening_2_11 = pd.to_datetime(['2015-2-11 20:00',
'2015-2-11 21:00',
'2015-2-11 22:00',
'2015-2-11 23:00',],
format = time_format)

# Reindexing dataframe
sales.reindex(evening_2_11) #find matching date index in sales and return corresponding rows, the unmatched index will return NaN in the row cells
sales.reindex(evening_2_11, method = 'ffill') #forward fill the nas with the previous record
sales.reindex(evening_2_11, method = 'bfill') #backward fill the nas with the next record
sales.reindex(evening_2_11).interpolate(how = 'linear') #fill Nas with linear approximated values

Resampling pandas time series

Resampling here is a statistical method used over different time intervals.
Downsampling reduce datetime rows to slower frequency
Upsampling increase datetime rows to faster frequency
List of most frequently used resampling input:
'min', 'T': minute
'H': hour
'D': day
'B': business day
'W': week
'M': month
'Q': quarter
'A': year

Downsampling

1
2
3
4
5
6
7
8
9
10
11
12
# Aggregation

daily_mean = sales.resample('D').mean() #'D' represents day. daily average sales

daily_sum = sales.resample('D').sum().max() #highest daily total

sales.resample('W').count() #weekly sales records

sales.loc[:, 'Unites'].resample('2W').sum() #2 week interval sum

population.resample('A').first()# the population at year interval
population.resample('A').first().interpolate(how = 'linear') #simulate the linear model and fill the NAs with the simulated values

Upsampling
The below example upsamples the daily records to 4-hour interval records.

1
2
two_days = sales.loc['2015-2-4':'2015-2-5', 'Units']
two_days.resample('4H').ffill() #extract a 4 hour interval and fill the nas with forward fill

Rolling time window by .rolling(). For example, with a Series hourly_data, hourly_data.rolling(window=24).mean() would compute new values for each hourly point, based on a 24-hour window stretching out behind each point. The frequency of the output data is the same: it is still hourly. Such an operation is useful for smoothing time series data.

1
2
3
4
5
6
7
8
9
10
11
12
# Extract data from 2010-Aug-01 to 2010-Aug-15: unsmoothed
unsmoothed = df['Temperature']['2010-Aug-01':'2010-Aug-15']

# Apply a rolling mean with a 24 hour window: smoothed
smoothed = unsmoothed.rolling(window = 24).mean()

# Create a new DataFrame with columns smoothed and unsmoothed: august
august = pd.DataFrame({'smoothed':smoothed, 'unsmoothed':unsmoothed})

# Plot both smoothed and unsmoothed data using august.plot().
august.plot()
plt.show()

Manipulating pandas time series

  1. String methods
    .str.upper() can convert the string to upper case.
    .str.contains('a') can search the key word a, and returns a boolean series. Therefore, with .str.contains('a').sum() we can count the number of the records that contain ‘a’.
    .str.string() can strip extra whitespace

    1
    times_tz_none = pd.to_datetime( la['Date (MM/DD/YYYY)'] + ' ' + la['Wheels-off Time'] )
  2. Datetime methods
    .dt.hour can extract the hour of the datetime value as a series
    .dt.tz_localize('US/Central') and .dt.tz_convert('US/Eastern') can convert the time zone. First use localize then use convert.

Visualizing pandas time series

Style format string:

  • color (k: black), (b: blue), (g: green), (r: red), (c: cyan)
  • marker (.: dot), (o: circle), (*: star), (s: square), (+: plus)
  • line type (-: solid), (:dotted), (–: dashed)

area plot :kind = 'area'

To plot plots separately: subplot = True

Before plotting time series data, make sure the set the index to datetime object.

1
2
3
4
5
6
7
8
9
10
# Convert the 'Date' column into a collection of datetime objects: df.Date
df.Date = pd.to_datetime(df.Date)

# Set the index to be the converted 'Date' column
df.set_index('Date', inplace = True) #inplace = True will automatically replace the original index without further assignment

# Re-plot the DataFrame to see that the axis is now datetime aware!
df.plot()
plt.show()
plt.clf() #clean the graph

Case Study

1
2
3
import pandas as pd
df_headers = pd.read_csv('data.csv', header = None)
print(df_headers.head())

pandas has been imported in the workspace as pd, and the file NOAA_QCLCD_2011_hourly_13904.txt has been parsed and loaded into a DataFrame df. The comma separated string of column names, column_labels, and list of columns to drop, list_to_drop, have also been loaded for you.

1
2
3
4
5
6
7
# Split on the comma to create a list: column_labels_list
column_labels_list = column_labels.split(',')
df.columns = column_labels_list

# Remove the appropriate columns: df_dropped
df_dropped = df.drop(list_to_drop, axis = 'columns')
print(df_dropped.head())

Convert the index into datetime object

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Convert the date column to string: df_dropped['date']
df_dropped['date'] = df_dropped['date'].astype(str)

# Pad leading zeros to the Time column: df_dropped['Time']
df_dropped['Time'] = df_dropped['Time'].apply(lambda x:'{:0>4}'.format(x)) #add leading zeros so each value is 4 digit

# Concatenate the new date and Time columns: date_string
date_string = df_dropped.date + df_dropped.Time

# Convert the date_string Series to datetime: date_times
date_times = pd.to_datetime(date_string, format = '%Y%m%d%H%M')

# Set the index to be the new date_times container: df_clean
df_clean = df_dropped.set_index(date_times)

# Print the output of df_clean.head()
print(df_clean.head())

Convert temperature columns into numeric values

1
2
3
4
5
6
# Convert the dry_bulb_faren column to numeric values: df_clean['dry_bulb_faren']
df_clean['dry_bulb_faren'] = pd.to_numeric(df_clean['dry_bulb_faren'], errors = 'coerce')

# Convert the wind_speed and dew_point_faren columns to numeric values
df_clean['wind_speed'] = pd.to_numeric(df_clean['wind_speed'], errors = 'coerce')
df_clean['dew_point_faren'] = pd.to_numeric(df_clean['dew_point_faren'], errors = 'coerce')

Now the data is clean, we can start EDA.
Statistical EDA:

1
2
3
4
5
6
7
8
# Print the median of the dry_bulb_faren column
print(df_clean.dry_bulb_faren.median())

# Print the median of the dry_bulb_faren column for the time range '2011-Apr':'2011-Jun'
print(df_clean.loc['2011-Apr':'2011-Jun', 'dry_bulb_faren'].median())

# Print the median of the dry_bulb_faren column for the month of January
print(df_clean.loc['2011-Jan', 'dry_bulb_faren'].median())

.reset_index() can set the index back to 0-based indexing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Downsample df_clean by day and aggregate by mean: daily_mean_2011
daily_mean_2011 = df_clean.resample('D').mean()

# Extract the dry_bulb_faren column from daily_mean_2011 using .values: daily_temp_2011
daily_temp_2011 = daily_mean_2011.dry_bulb_faren.values

# Downsample df_climate by day and aggregate by mean: daily_climate
daily_climate = df_climate.resample('D').mean()

# Extract the Temperature column from daily_climate using .reset_index(): daily_temp_climate
daily_temp_climate = daily_climate.reset_index().Temperature #note that we reset the index to start from 0 so that we can calculate the difference between two series in the next step

# Compute the difference between the two arrays and print the mean difference
difference = daily_temp_2011 - daily_temp_climate
print(difference.mean())

Sunny or cloudy?

1
2
3
4
5
6
7
8
9
10
11
12
# Select days that are sunny: sunny
sunny = df_clean[df_clean.sky_condition == 'CLR']

# Select days that are overcast: overcast
overcast = df_clean[df_clean.sky_condition.str.contains('OVC')]

# Resample sunny and overcast, aggregating by maximum daily temperature
sunny_daily_max = sunny.resample('D').max()
overcast_daily_max = overcast.resample('D').max()

# Print the difference between the mean of sunny_daily_max and overcast_daily_max
print(sunny_daily_max.mean() - overcast_daily_max.mean())

Correlation between visibility and temperature:

1
2
3
4
5
6
7
8
9
10
11
12
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Select the visibility and dry_bulb_faren columns and resample them: weekly_mean
weekly_mean = df_clean[['visibility','dry_bulb_faren']].resample('W').mean()

# Print the output of weekly_mean.corr()
print(weekly_mean.corr())

# Plot weekly_mean with subplots=True
weekly_mean.plot(subplots = True)
plt.show()

Daily hours of clear sky

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Create a Boolean Series for sunny days: sunny
sunny = df_clean.sky_condition == 'CLR'

# Resample the Boolean Series by day and compute the sum: sunny_hours
sunny_hours = sunny.resample('D').sum()

# Resample the Boolean Series by day and compute the count: total_hours
total_hours = sunny.resample('D').count()

# Divide sunny_hours by total_hours: sunny_fraction
sunny_fraction = sunny_hours / total_hours

# Make a box plot of sunny_fraction
sunny_fraction.plot(kind = 'box')
plt.show()

Heat or humidity
Dew point is a measure of relative humidity based on pressure and temperature. A dew point above 65 is considered uncomfortable while a temperature above 90 is also considered uncomfortable.

In this exercise, you will explore the maximum temperature and dew point of each month. The columns of interest are ‘dew_point_faren’ and ‘dry_bulb_faren’. After resampling them appropriately to get the maximum temperature and dew point in each month, generate a histogram of these values as subplots. Uncomfortably, you will notice that the maximum dew point is above 65 every month!

1
2
3
4
5
6
7
8
# Resample dew_point_faren and dry_bulb_faren by Month, aggregating the maximum values: monthly_max
monthly_max = df_clean[['dew_point_faren', 'dry_bulb_faren']].resample('M').max()

# Generate a histogram with bins=8, alpha=0.5, subplots=True
monthly_max.plot(kind = 'hist', bins = 8, alpha = 0.5, subplots = True)

# Show the plot
plt.show()

Probability of high temperatures
We already know that 2011 was hotter than the climate normals for the previous thirty years. In this final exercise, you will compare the maximum temperature in August 2011 against that of the August 2010 climate normals. More specifically, you will use a CDF plot to determine the probability of the 2011 daily maximum temperature in August being above the 2010 climate normal value. To do this, you will leverage the data manipulation, filtering, resampling, and visualization skills you have acquired throughout this course.

The two DataFrames df_clean and df_climate are available in the workspace. Your job is to select the maximum temperature in August in df_climate, and then maximum daily temperatures in August 2011. You will then filter out the days in August 2011 that were above the August 2010 maximum, and use this to construct a CDF plot.

Once you’ve generated the CDF, notice how it shows that there was a 50% probability of the 2011 daily maximum temperature in August being 5 degrees above the 2010 climate normal value!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Extract the maximum temperature in August 2010 from df_climate: august_max
august_max = df_climate.loc['2010-Aug','Temperature'].max()
print(august_max)

# Resample the August 2011 temperatures in df_clean by day and aggregate the maximum value: august_2011
august_2011 = df_clean.loc['2011-Aug', 'dry_bulb_faren'].resample('D').max()

# Filter out days in august_2011 where the value exceeded august_max: august_2011_high
august_2011_high = august_2011[august_2011 > august_max]

# Construct a CDF of august_2011_high
august_2011_high.plot(kind = 'hist', cumulative = True, normed = True, bins = 25)

# Display the plot
plt.show()